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Abstract. Finding an unconstrained and statistically interpretable re- 
parameterization of a covariance matrix is still an open problem in 
statistics. Its solution is of central importance in covariance estimation, 
particularly in the recent high-dimensional data environment where 
enforcing the positive-definiteness constraint could be computationally 
expensive. We provide a survey of the progress made in modeling covari- 
ance matrices from two relatively complementary perspectives: (1) gen- 
eralized linear models (GLM) or parsimony and use of covariates in 
low dimensions, and (2) regularization or sparsity for high-dimensional 
data. An emerging, unifying and powerful trend in both perspectives is 
that of reducing a covariance estimation problem to that of estimating 
a sequence of regression problems. We point out several instances of 
the regression-based formulation. A notable case is in sparse estima- 
tion of a precision matrix or a Gaussian graphical model leading to 
the fast graphical LASSO algorithm. Some advantages and limitations 
of the regression-based Cholesky decomposition relative to the classical 
spectral (eigenvalue) and variance-correlation decompositions are high- 
lighted. The former provides an unconstrained and statistically inter- 
pretable reparameterization, and guarantees the positive-definiteness 
of the estimated covariance matrix. It reduces the unintuitive task of 
covariance estimation to that of modeling a sequence of regressions at 
the cost of imposing an a priori order among the variables. Element- 
wise regularization of the sample covariance matrix such as banding, 
tapering and thresholding has desirable asymptotic properties and the 
sparse estimated covariance matrix is positive definite with probability 
tending to one for large samples and dimensions. 
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1. INTRODUCTION 

The p X p covariance matrix S of a random vec- 
tor Y = {yi, ■ ■ ■ tVpY with as many as ^^^^^^ con- 
strained parameters plays a central role in virtu- 
ally all of classical multivariate statistics (Ander- 
son, 2003), time series analysis (Box, Jenkins and 
Reinsel, 1994), spatial data analysis (Cressie, 1993), 
variance components and longitudinal data analy- 
sis (Searle, Casella and McCulloch, 1992; Diggle et 
al., 2002), and in the modern and rapidly growing 
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area of statistical and machine learning dealing with 
massive and high-dimensional data (Hastie, Tibshi- 
rani and Friedman, 2009). It is generally recognized 
that the two major challenges in covariance estima- 
tion are the positive-definiteness constraint and the 
high-dimensionality where the number of parame- 
ters grows quadratically in p. In this survey, we point 
out that these challenges become manageable, for 
example, by reducing covariance estimation to that 
of solving a series of (penalized) least squares regres- 
sion problems. 

Nowadays, in microarray data, spectroscopy, finan- 
ce, climate studies and abundance data in commu- 
nity ecology it is common to have situations where 
p ^ n. Here the use of a sample covariance ma- 
trix is problematic (Stein, 1956), particularly when 
its inverse is needed as, for example, in classifica- 
tion procedures (Anderson, 2003, Chapter 6), multi- 
variate linear regression (Warton, 2008; Witten and 
Tibshirani, 2009), portfolio selection (Ledoit, Santa- 
Clara and Wolf, 2003) and Gaussian graphical mod- 
els (Wong, Carter and Kohn, 2003; Meinshausen and 
Biihlmann, 2006; Yuan and Lin, 2007). In these sit- 
uations and others, it is desirable to find alternative 
covariance estimators that are more accurate and 
better-conditioned than the sample covariance ma- 
trix. 

It was noted rather early by Stein (1956, 1975) 
that the sample covariance matrix S = ^ SiLi 
based on a sample of size n from a mean zero normal 
population with the covariance matrix S, though 
unbiased and positive definite, is a poor estimator 
when ^ is large (Johnstone, 2001). It distorts the 
eigenstructure of S, in the sense that the largest 
(smallest) sample eigenvalue will be biased upward 
(downward). Since then many improved estimators 
have been proposed by shrinking the eigenvalues 
of S toward a central value (Half, 1980, 1991; Lin 
and Perlman, 1985; Dey and Srinivasan, 1985; Yang 
and Berger, 1994; Ledoit and Wolf, 2004). These 
have been derived from a decision-theoretic perspec- 
tive or by specifying an appropriate prior for the co- 
variance matrix. The Stein's family of shrinkage esti- 
mators leaving intact the eigenvectors of the sample 
covariance matrix are neither sparse nor parsimo- 
nious. However, lately the search for sparsity and 
parsimony has led to either shrinking the matrix S 
itself toward certain targets like diagonal and au- 
toregressive structures as in Daniels and Kass (1999, 
2001), or shrinking its eigenvectors as in Hoff (2009) 
and Johnstone and Lu (2009). 



In many applications the need for the precision 
matrix is stronger than that for S itself. Though 
the former can be computed from the latter in 0{p'^) 
operations, this could be computationally expen- 
sive and should be avoided when p is large. The 
regression-based approach of Meinshausen and Biihl- 
mann (2006) provides a sparse estimate of the preci- 
sion matrix or a Gaussian graphical model by fitting 
separate LASSO regression to each variable, using 
the others as predictors. This simple idea has in- 
spired several direct and improved sparse estimators 
of using a penalized likelihood approach with 
a LASSO penalty on its off-diagonal terms (Yuan 
and Lin, 2007; Banerjee, El Ghaoui and d'Aspremont, 
2008; Friedman, Hastie and Tibshirani, 2008; Roth- 
man et al., 2008; Rocha, Zhao and Yu, 2008; Peng et 
al., 2009). Friedman, Hastie and Tibshirani (2008) 
graphical LASSO is the fastest available algorithm 
to date. Surprisingly, such a sparse covariance esti- 
mator is guaranteed to be positive definite (Baner- 
jee, El Ghaoui and d'Aspremont, 2008). 

A remarkable unifying regression-based theme has 
emerged from research on covariance estimation in 
the last decade or so. Some notable examples are as 
follows: (i) formulating principal component analy- 
sis (PCA) as regression optimization problems (Jong 
and Kotz, 1999; Zou, Hastie and Tibshirani, 2006), 
sparse loadings are then estimated by imposing the 
lasso constraint on the regression coefficients, (ii) 
regression-based derivation and interpretation of the 
modified Cholesky decomposition of a covariance 
matrix and its inverse (Pourahmadi, 1999, 2001, Sec- 
tion 3.5; Bilmes, 2000; Huang et al., 2006; Rothman, 
Levina and Zhu, 2010), (iii) the regression approach 
of Meinshausen and Biihlmann (2006) to the Gaus- 
sian graphical models, (iv) the graphical LASSO al- 
gorithm of Friedman, Hastie and Tibshirani (2008, 
2010) and (v) the iteratively reweighted penalized 
likelihood of Fan, Feng and Wu (2009) where non- 
concave penalties such as the smoothly clipped ab- 
solute deviation (SCAD) are imposed on the entries 
of the precision matrix. The problem of sparse es- 
timation of the precision matrix is then recast as 
a sequence of penalized likelihood problems with 
a weighted LASSO penalty and solved using the 
graphical LASSO algorithm of Friedman, Hastie and 
Tibshirani (2008). 

Among these approaches it seems only (ii) has the 
expressed goal of providing unconstrained and sta- 
tistically interpretable regression parameters for the 
covariance (precision) matrix. Unfortunately, how- 
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ever, unlike the others which work for unordered 
variables and provide permutation-invariant covari- 
ance estimators, (ii) and a few other alternatives 
to the sample covariance matrix proposed in recent 
years give rise to covariance estimators which are 
sensitive to the order among the variables in Y. 
These approaches are suitable for time series and 
longitudinal data which have a natural (time) order 
among the variables in Y , and assume that variables 
far apart in the ordering are less correlated. For ex- 
ample, regularizing a covariance matrix by taper- 
ing (Furrer and Bengtsson, 2007), banding (Bickel 
and Levina, 2004, 2008a; Wu and Pourahmadi, 2003, 

2009) and generally those based on the Cholesky de- 
composition of the covariance matrix or its inverse 
(Pourahmadi, 1999, 2000; Rothman, Levina and Zhu, 

2010) do impose an order among the components 
of Y and are not permutation-invariant. The idea of 
thresholding individual entries of S has been used in 
the estimation of large covariance matrices by Bickel 
and Levina (2008b), El Karoui (2008a, 2008b) and 
Rothman, Levina and Zhu (2009). Such estimators 
are permutation-invariant with desirable asymptotic 
properties. 

It should be noted that the recent surge of interest 
in regression-based approaches to sparsity in high- 
dimensional data bodes well with the long history 
of interest in parsimony and using covariates when 
modeling covariance matrices of low-dimensional da- 
ta (Anderson, 1973). For example, longitudinal data 
collected from expensive clinical trials and biologi- 
cal experiments may have about n = 30 subjects and 
p < 10 measurements per subject. Parsimonious and 
accurate modeling of the covariance structure is im- 
portant in these application areas (Cannon et al., 
2001; Carroh, 2003; Fitzmaurice et al., 2009). How- 
ever, the area of data-based covariance modeling is 
woefully underdeveloped. At present, a practitioner 
has the option of picking a structured covariance 
matrix from a long menu, where at one extreme the 
choice is a^Ip (independence) and at the other the 

unstructured covariance matrix with ^^^^^ param- 
eters (Zimmerman and Niihez-Anton, 2001, 2010). 
Of course, it is desirable to bridge the gap between 
these two extremes and develop a bona fide GLM 
methodology and a data-based framework for mod- 
eling covariance matrices. Attempts to develop such 
methods going beyond the traditional linear covari- 
ance models (Anderson, 1973) have been made in 
recent years by Chiu, Leonard and Tsui (1996) and 
Pourahmadi (1999, 2000); Pan and MacKenzie 



(2003); Lin and Wang (2009); Leng, Zhang and Pan 
(2010); Lin (2011) using the spectral and Cholesky 
decompositions of covariance matrices, respectively. 

Given the complex nature of the positive-definite- 
ness constraint, in developing a GLM methodolgy it 
is plausible to factorize S into two or more compo- 
nents capturing the "variance" through a diagonal 
matrix and the "dependence" through a matrix with 
functionally unrelated entries. A decomposi- 
tion is ideal for the GLM purposes, if its "depen- 
dence" component is an unconstrained and statis- 
tically interpretable matrix. The three most com- 
monly used decompositions in increasing order of 
adherence to the GLM principles are the variance- 
correlation, spectral and Cholesky decompositions 
where their "dependence" components are correla- 
tion, orthogonal and lower triangular matrices, re- 
spectively. While the entries of the first two ma- 
trices are always constrained, those of the last are 
unconstrained. Interestingly, these three decomposi- 
tions are subsumed (Zimmerman and Niihez-Anton, 
2001, page 59) by a decomposition from the class of 
factor/mixed models (Anderson, 2003): 

(1) S = ZBZ' + W. 

Here, the matrix Z is p x q with q standing for the 
number of latent factors, B and W are q x q and 
p X p unknown preferably diagonal matrices. The 
representation (1) is valid only when each of the p 
variables are well-approximated as linear combina- 
tions of the same latent factors plus an independent 
error. In principle, this may occur when q is large, 
and adding W to the reduced rank decomposition 
ensures the positive-definiteness of S. Technical dif- 
ficulties with the use of (1) can be resolved to various 
extents by choosing the components of the quadru- 
ple {q,W, B, Z) close to the ideal values of q = p, 
W = 0, B diagonal and Z sparse or structured. 

The outline of the paper is as follows. Section 2 
covers some preliminaries on the GLM for covari- 
ance matrices, the three standard decompositions 
of a covariance matrix, a regression-based decom- 
position of the precision matrix useful in Gaussian 
graphical models, a review of covariance estimation 
from the GLM perspective and its evolution through 
linear/inverse, log and hybrid link functions. Steinian 
shrinkage, regularization (banding, tapering and 
thresholding), penalized likelihood estimation and 
improvement of the sample covariance matrix for 
high-dimensional data are discussed in Section 3. 
Some prior distributions on the parameters of the 
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factors of the three decompositions and their roles 
in the Bayesian inference are reviewed in Section 4. 
Section 5 concludes the paper. 

This survey emphasizes the importance of regres- 
sion-based idea and hence the need for unconstrained 
reparameterization in both the GLM- and regulariza- 
tion-type approaches to covariance estimation for 
low- and high-dimensional data. As such, it has a rel- 
atively narrow focus; important topics like robust- 
ness, use of random-effects models, nonparametric 
and semi-parametric methods in covariance estima- 
tion are not discussed. It is hoped to serve as a guide 
or a blueprint for further research in this active and 
growing area of current interest in statistics. 

2. THE GLM AND MATRIX 
DECOMPOSITIONS 

In this section the importance of the GLM, the 
role of the three matrix decompositions in remov- 
ing the positive-definiteness constraint on a covari- 
ance matrix, the connection between reparameteriz- 
ing the precision matrix and the Gaussian graphical 
models, along with linear, log-linear and generalized 
linear models for covariance matrices are reviewed. 

2.1 Positive-Definiteness and the GLM 

A major stumbling block in covariance estimation, 
particularly when using covariates, is the notorious 
positive-definiteness constraint. Since a covariance 
matrix defined by S = E{Y — ^){Y — //)', is a mean- 
like parameter, it is natural to exploit the idea of 
GLM to develop a systematic, data-based statisti- 
cal model- fitting procedure for covariance matrices. 
However, unlike the mean vector where a link func- 
tion acts elementwise, for covariance matrices el- 
ementwise transformations are not enough, as the 
positive-definiteness is a simultaneous constraint on 
all its entries. More global transformations engag- 
ing possibly all entries of a covariance matrix are 
needed to remove the constraint. 

Thus, the GLM approach to covariance estima- 
tion hinges on finding link functions that induce un- 
constrained and statistically interpretable reparam- 
eterization. Not surprisingly, most common and suc- 
cessful modeling approaches decompose a covariance 
matrix into its "variance" and "dependence" compo- 
nents, and write regression models using covariates 
for the logarithm of the "variances." However, writ- 
ing such regression models for the entries of the "de- 
pendence" component is still a challenging problem 
because these are often constrained. In the next sec- 
tion examples of unconstrained parameterizations 



of a covariance matrix are given which involve the 
variance-correlation, spectral and Cholesky decom- 
positions. 

2.2 The Matrix Decompositions 

In this section we present the roles of the variance- 
correlation, spectral and Cholesky decompositions 
in potentially removing the positive-definiteness con- 
straint on a covariance matrix, and paving the way 
for using covariates to reduce its high number of pa- 
rameters. 

2.2.1 The variance- correlation decomposition The 
simple decomposition S = DRD, where D is the di- 
agonal matrix of standard deviations and R = (pij) 
is the correlation matrix of Y, has a strong practi- 
cal appeal since both factors are easily interpreted 
in terms of the original variables. It allows one to 
estimate D and R separately, which is important in 
situations where one factor is more important than 
the other (Lin and Perlman, 1985; Liang and Zeger, 
1986; Barnard, McCulloch and Meng 2000). 

Note that while the logarithm of the diagonal en- 
tries of D are unconstrained, the correlation ma- 
trix R must be positive definite with the additional 
constraint that all its diagonal entries are equal to 1 . 
Thus, it is inconvenient to work with it in the frame- 
work of GLM and to reduce its large number of pa- 
rameters. In the literature of longitudinal data anal- 
ysis (Liang and Zeger, 1986; Diggle et al., 2002; Zim- 
merman and Niihez- Anton, 2010) and other applica- 
tion areas dealing with correlated data, in the inter- 
est of expediency, parsimony and ensuring positive- 
definiteness structured correlation matrices with 
a few parameters are preferred. Fan, Huang and 
Li (2007) have studied a semiparametric model for 
a covariance structure by estimating the marginal 
variances via kernel smoothing and used specific pa- 
rametric models for the correlation matrix such as 
the ARMA(1,1). 

2.2.2 Decomposition of the precision matrix: Gaus- 
sian graphical models Recall that the marginal (pair- 
wise) dependence among the entries of a random 
vector is captured by the off-diagonal entries of S 
or the entries of the correlation matrix R = (pij). 
However, the conditional dependencies can be found 
in the off-diagonal entries of the precision matrix 

= (o"*-'). More precisely, for Y a mean zero nor- 
mal random vector with a positive-definite covari- 
ance matrix, if the ij'th component of the precision 
matrix is zero, then the variables yi and yj are con- 
ditionally independent, given the other variables. 
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Conditional independence structure in Y is often 
shown as a graphical model with the nodes corre- 
sponding to variables and the absence of edges in- 
dicating conditional independence (Anderson, 2003, 
Chapter 15). 

In this section we give several regression interpre- 
tations of the entries of the variance-correlation de- 
composition of the precision matrix: 

Most of these are motivated by the recent surge of 
activities in sparse estimation of in the context 
of Gaussian graphical models sparked by the ap- 
proach in Meinshausen and Biihlmann (2006) based 
on solving p separate LASSO regression problems. 
We show that the entries of (i?, D) have direct sta- 
tistical interpretations in terms of the partial corre- 
lations, and variance of predicting a variable given 
the rest. More precisely, standard regression calcula- 
tions show that the partial correlation coefficient be- 
tween Ui and Uj after removing the linear effect of the 
p — 2 remaining variables is given by pij = — , . . , 

and that df^, the partial variance of yi after remov- 
ing the linear effect of the remaining p — 1 variables, 
is given by i-. 

For this and other regression-based techniques re- 
viewed in this survey, it is instructive to partition 
a random vector Y into two components (y/,1^')' of 
dimensions pi and p2- Similarly, its covariance and 
precision matrices will be partitioned conformally as 

■5.11 5.12- 
5.21 5.22 



Sii S12 
S21 S22 



Some useful relationships among the blocks of S 
and are obtained by considering the linear least- 
squares regression (prediction) of Y2 based on Yi. 
Let the p2 x pi matrix $2|i be the regression coeffi- 
cients matrix and the vector of regression residuals 
be denoted by Y2.1 = I2 - *^*2|i^i- Recall that ^>2|i 
and the corresponding prediction error covariance 
matrix can be found by requiring that the vector of 
residuals Y2.1 be uncorrelated with Yi. Thus, 



(2) 
and 

(3) 



2|1 



21 



S91 S 



Cov(y2-i) = S22 - S2ii;i/Si2 

V /v22^-l 



-'22-1 



(S^ 



Certain special choices of I2 corresponding to p2 = 
1,2 are helpful in connecting $2|i)^22 i directly to 



the entries of the precision matrix as we discuss 
below. 

First, when p2 = 1, ^2 = for a fixed i, and Yi = 
(yi, . . . , yi-i,yi+i, ...,yp)' = , then E22.1 is a sca- 
lar, called the partial variance of yi given the rest. 
Let yi be the linear least-squares predictor of yi 
based on the rest Yl(i) , and Ei = yi — jji, of = Var(ej) 
be its prediction error and prediction error variance, 
respectively. Then, 



(4) 



and it follows immediately from (2) and (3) that the 
regression coefficients of yi on YL^j), are given by 



(5) 
and 

(6) 



a- 



a' 



di = yar{yi\yj,j / i) = 



L 



This shows that o"*-', the {i,j) entry of the precision 
matrix, is, up to a scalar, the regression coefficient of 
variable j in the multiple regression of variable i on 
the rest. As such, each Pij is an unconstrained real 
number, note that fijj = and fiij is not symmetric 
in {i,j). 

Writing (5) in matrix form gives another useful 
factorization of the precision matrix: 



(7) 



where D is a diagonal matrix with dj as its jth di- 
agonal entry, and B is a p x p matrix with zeros 
along its diagonal and f3j^k in the {j, k)th. position. 
Now, it is evident from (7) that the sparsity pat- 
terns of and B are the same, and, hence, the 
former can be inferred from the latter using the re- 
gression setup (4). This is the key conceptual tool 
behind the approach of Meinshausen and Biihlmann 
(2006). Note that the left-hand side of (7) is a sym- 
metric matrix while the right-hand side is not neces- 
sarily so. Thus, one must impose the following sym- 
metry constraint (Rocha, Zhao and Yu 2008; Fried- 
man, Hastie and Tibshirani, 2010) for j,k = 1, . . . ,p: 



(8) 



>jk 



d^,l3, 



'j Pkj ■ 



As another important example, take p2 = 2, Y2 = 
{yi, yj), ij^ j and Yi = Y_(^ij-^ comprising the remain- 
ing p — 2 variables. Then, it follows from (3) that the 
covariance matrix between yi,yj, after eliminating 
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the linear effects of the other p — 2 components, is 
given by 

-1 



^22-1 



where A 



a 



a 



a- 



a- 



■33 



(T- 



■33 



-a 



-a 
a' 



^^rr^3 



a a- 



(o"*-')^. The correlation coeffi- 
cient in S22-1 is, indeed, the partial correlation coef- 
ficient between and yj-. 



(9) 



Pij 



a 



as announced earlier. Moreover, from (5) and (9) it 
follows that 



(10) 



A 



a33 



This representation which shows that S ^ and R 
share the same sparsity pattern is the basis for the 
Peng, Zhou and Zhu (2009) SPACE algorithm which 
imposes a LASSO penalty on the off-diagonal en- 
tries of the matrix of partial correlations R; see also 
Friedman, Hastie and Tibshirani (2010). 

2.2.3 The spectral decomposition The spectral de- 
composition of a covariance matrix given by 



(11) 



S = PAP' = ^X 



i=l 



where A is a diagonal matrix of eigenvalues and P 
the orthogonal matrix of normalized eigenvectors 
with the Cj as its ith column, is familiar from the lit- 
erature of principal component analysis (Anderson, 
2003; Flury, 1988). The entries of A and P have 
interpretations as variances and coefficients of the 
principal components. The matrix P being orthogo- 
nal is constrained, so that it is inconvenient to work 
with it in the framework of GLM or to use covariates 
to reduce its high number of parameters. 

In spite of the severe constraint on the orthogo- 
nal matrix, the spectral decomposition is the source 
of a new unconstrained reparameterization due to 
Leonard and Hsu (1992) and Chiu, Leonard and 
Tsui (1996). They observed that the logarithm of 
a covariance matrix S defined by 



(12) logs = PlogAP' = (log 



is an unconstrained symmetric matrix. However, 
a drawback of this transformation (link function) 
seems to be the lack of statistical interpretability 



of the entries of logS (Brown, Le and Zidek, 1994; 
Liechty, Liechty and Miiller, 2004). From (11) and 

(12) it is evident that the entries of S and logS are 
similar functions of the entries of P and A, except 
that in (12) Aj is replaced by log Aj. Can this "small" 
substitution be the reason for the "big" difference 
in the statistical interpretability of the entries of log 
of a covariance matrix and the matrix itself? This 
case is interesting as it points out to a sort of trade- 
off that exists between the requirements of uncon- 
strained reparameterization of covariance matrices 
and statistical interpretability of the new parame- 
ters. 

2.2.4 The Cholesky decompositions The standard 
Cholesky decomposition of a positive-definite ma- 
trix encountered in some optimization techniques, 
software packages and matrix computation (Golub 
and Van Loan, 1989) is of the form 

(13) S = CC", 

where C = {cij) is a unique lower-triangular matrix 
with positive diagonal entries. Statistical interpreta- 
tion of the entries of C is difficult in its present form 
(Pinheiro and Bates, 1996). However, reducing C to 
unit lower-triangular matrices through multiplica- 
tion by the inverse oi D = diag(cii, . . . , Cpp) makes 
the task of statistical interpretation of the diagonal 
entries of C and the ensuing unit lower-triangular 
matrix much easier. 

For example, using basic matrix multiplication, 
(13) can be rewritten as 



(14) 



CD^^DDD-'C = LD^L 



-i2 tI 



where L = CD~^ is obtained from C by dividing 
the entries of its ith column by ca. This is usually 
called the modified Cholesky decomposition of S; it 
can also be written in the forms 



(15) 



where T = . Note that the second identity is, 
in fact, the modified Cholesky decomposition of the 
precision matrix and the first identity in (15) 
looks a lot like the spectral decomposition, in that S 
is diagonalized by a lower triangular matrix. How- 
ever, we show that unlike the constrained entries 
of the orthogonal matrix of the spectral decomposi- 
tion, the nonredundant entries of T = are un- 
constrained and statistically meaningful. Further- 
more, the argument makes it clear that the parame- 
ters in the factors of the Cholesky decomposition are 
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dependent on the order in which the variables ap- 
pear in the random vector Y. Wagaman and Levina 
(2009) have proposed an Isomap method for discov- 
ering an order among the variables based on their 
correlations. This could lead to block-diagonal or 
banded correlation structures which may help to fix 
a reasonable order before applying the Cholesky de- 
composition; see Section 5. 

As in Section 2.2.2, we use the idea of regression 
to show that T and D can be constructed directly by 
regressing a variable yt on its predecessors. In what 
follows, it is assumed that y is a random vector 
with mean zero and a positive-definite covariance 
matrix S. Let yt be the linear least-squares predictor 
of yt based on its predecessors yt-i, ■ ■ ■ ,yi, and et = 
yt — yt be its prediction error with variance = 
Var(ef). Then, there are unique scalars (ptj so that 



t-i 



(16) 



yt 



^4>tjyj + et, t = i, 



Next, we show how to compute the regression coeffi- 
cients (ptj using the covariance matrix. For a fixed t, 
2 < t < p, set = {(j)ti-, . . . , 4>t,t-iy and let St be the 
{t — 1) X (t— 1) leading principal minor of S and at 
be the column vector composed of the first t — 1 en- 
tries of the tth column of S. Then, from (2) and (3) 



with Yi = (yi 
(17) <Pt = 



yt 

1:^ 



^i) ,Y2 = yt it follows that 



^7'^t 



<ytt 



Gtl^t ^t- 



Let e = (ei , . . . , Sp)' be the vector of successive un- 
correlated prediction errors with Cov(e) =diag((Tf, 
. . . , dp) = . Then, (16) can be rewritten in matrix 
form as e = TY , where T is the following unit lower 
triangular matrix: 



(18) 



1 



^21 
^31 



'Ml 



1 



On2 



1/ 



Now, computing Cov(e) = Cov(Ty) = TT,T' gives 
the modified Cholesky decomposition (15). 

Since the (pij^s in (17) are simply the regression 
coefficients computed from an unstructured covari- 
ance matrix, these coefficients along with log ctj are 
unconstrained (Pourahmadi, 1999, 2000). From (16) 
it is evident that the regression or the orthogonal- 
ization process reduces the task of modeling a co- 
variance matrix to that of a sequence of p varying- 
coefficient and varying-order regression models. 



Thus, one can bring the familiar regression analysis 
machinery to handle the unintuitive task of model- 
ing covariance matrices (Smith and Kohn, 2002; Wu 
and Pourahmadi, 2003; Huang et al., 2006, Huang, 
Liu and Liu, 2007; Bickel and Levina, 2008a; Roth- 
man, Levina and Zhu, 2009). An important conse- 
quence of (15) is that for any estimate {T,D'^) of 
the Cholesky factors, the estimated precision matrix 
S"^ = T'D^^T is guaranteed to be positive definite. 

An alternative form of the Cholesky decomposi- 
tion (14) due to Chen and Dunson (2003), also ob- 
tained from (13), is 

S = DLL'D, 

where L = D~^C is obtained from C by dividing the 
entries of its ith. row by cu. This form has proved 
useful for joint variable selection for fixed and ran- 
dom effects in the linear mixed-effects models, and 
when the focus is on modeling the correlation ma- 
trix; see Bondell, Krishna and Ghosh (2010) and 
Pourahmadi (2007a). 

Some early and implicit examples of the use of the 
Cholesky decomposition in the literature of statis- 
tics include Bartlett's (1933) decomposition of a sam- 
ple covariance matrix, Wright's (1934) path analy- 
sis, Roy's (1958) step-down procedures and Wold's 
(1960) causal chain models which assume the ex- 
istence of an a priori order among the p variables 
of interest. Some of the more explicit uses are in 
Kalman (1960) for filtering of state-space models 
and the Gaussian graphical models (Wermuth, 1980). 
For other uses of Cholesky decomposition in multi- 
variate quality control and related areas see Pourah- 
madi (2007b). 

2.3 GLM for Covariance Matrices 

2.3.1 Linear covariance models The origin of lin- 
ear models for covariance matrices can be traced 
to the work of Yule (1927) and Gabriel (1962) and 
the implicit parameterization of a multivariate nor- 
mal distribution in terms of the entries of either S 
or its inverse. However, Dempster (1972) was the 
first to recognize the entries of = (cr*-') as the 
canonical parameters of the exponential family of 
normal distributions. He proposed to select or esti- 
mate a covariance matrix efficiently and sparsely by 
identifying zeros in its inverse, and referred to the 
procedure as covariance selection models. It fits the 
framework of linear covariance models defined next. 

Motivated by the simple and linear structure of 
covariance matrices of some time series and variance 
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component models, Anderson (1973) introduced the 
class of linear covariance models (LCM): 

(19) J:^^ = aiUi + ■ ■ ■ + aqUg, 

where C/j's are some known symmetric basis matri- 
ces and Oj's are unknown parameters; they must be 
restricted so that the matrix is positive definite. It is 
usually assumed that there is at least a set of coeffi- 
cients where is positive definite. The model (19) 
is rather general, indeed, for q = p'^ any covariance 
matrix admits the representation 

p p 

(20) j: = {a,,) = ^^a,,Uij, 

i=i j=i 

where Uij is a p x p matrix with one on the {i,j)th 
position and zero elsewhere. 

Replacing S by 5 in the left-hand side of (19), it 
can be viewed collection of ^^^^"^^ linear regres- 
sion models. The same regression models viewpoint 
holds with the precision matrix on the left-hand 
side. The class of linear covariance models is om- 
nipresent when dealing with covariance matrices. It 
includes virtually any estimation method that acts 
elementwise on a covariance matrix such as tapering, 
banding, thresholding, covariance selections models, 
penalized likelihood with LASSO penalty on the en- 
tries of the precision matrix, etc.; see (20). 

A major drawback of (19) and (20) is the con- 
straint on the coefficients which could make the esti- 
mation and other statistical problems difficult (An- 
derson, 1973). Szatrowski (1980) gives necessary and 
sufficient conditions for the existence of explicit max- 
imum likelihood estimates, and the convergence of 
the iterative procedure in one iteration from any 
positive-definite starting point. 

A good review of the MLE procedures for the 
model (19) and their applications to the problem 
of testing homogeneity of the covariance matrices 
of several dependent multivariate normals is pre- 
sented in Jiang, Sarkar and Hsuan (1999). They de- 
rive a likelihood ratio test, and show how to com- 
pute the MLE of S, in both the restricted (null) 
and unrestricted (alternative) parameter spaces us- 
ing SAS PROC MIXED software. They also provide 
the code and the implementation is explained using 
several examples. 

The notion of covariance regression introduced by 
Hoff and Niu (2009) is also in the spirit of (19), but 
unlike the LCM the covariance matrix is quadratic 
in the covariates, and positive-definiteness is guar- 
anteed through the special construction. 



2.3.2 Log-linear covariance models A plausible 
way to remove the constraint on aj's in (19) is to 
work with the logarithm of a covariance matrix. The 
key fact needed here is that for a general covariance 
matrix with the spectral decomposition T, = PAP' , 
its matricial logarithm defined by log S = P log AP' 
is a symmetric matrix with unconstrained entries 
taking values in (—00,00). 

This idea has been pursued by Leonard and Hsu 
(1992) and Chiu, Leonard and Tsui (1996) who in- 
troduced the log-linear covariance models for S as 

(21) log^ = aiUi + ----^agUg, 

where f/j's are known matrices as before and the Oj's 
are now unconstrained. However, since logS is a high- 
ly nonlinear operation on S, the a^'s lack statistical 
interpretation (Brown, Le and Zidek, 1994; Liechty, 
Liechty and Miiller, 2004). Fortunately, for S diago- 
nal since logE = diag(logo"ii, . . . ,log(Tpp) is also di- 
agonal, it can be seen that (21) amounts to log-linear 
models for heterogeneous variances which have a long 
history in econometrics and other areas; see Carroll 
and Ruppert (1988) and references therein. 

Maximum likelihood estimation procedures for the 
parameters in (21) and their asymptotic properties 
are studied in Chiu, Leonard and Tsui (1996) along 
with the analysis of two real data sets. Given the 
fiexibility of the log-linear models, one would ex- 
pect them to be used widely in practice, however, 
this does not seem to be the case. An interesting 
application to spatial autoregressive (SAR) models 
and some of its computational advantages are dis- 
cussed in LeSage and Pace (2007). 

2.3.3 GLM via the Cholesky decomposition In this 
section the constraint and lack of interpretation 
of aj's in (19) and (21) are resolved simultaneously 
by relying on the Cholesky decomposition of a co- 
variance matrix described in Section 2.2.4. A bona 
fide GLM for the precision matrix in terms of co- 
variates is introduced and its maximum likelihood 
estimation (MLE) is discussed. An important conse- 
quence of the approach based on the modified Cho- 
lesky decomposition is that for any estimate of the 
Cholesky factors, the estimated precision matrix 

= T'D~^T is guaranteed to be positive definite. 
Recall that for an unstructured covariance ma- 
trix S, the nonredundant entries of its components 
(T, logD^) in (15) are unconstrained. Thus, follow- 
ing the GLM's tradition, one may write parame- 
teric models for them using covariates (Pourahmadi, 
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1999; Pan and MacKenzie, 2003; Zimmerman and 
Nuiiez- Anton, 2010). We consider the following para- 
metric models for (ptj and logfj^, for t = 1, . . . ,p; 
j = l,...,i-l, 

(22) log 0-^=4 A, cl)tj = ztjj. 

Here, Zt,Ztj are q x 1 and d x 1 vectors of known 
covariates, A = (Ai, . . . , A^)' and 7 = (71, ... , 7^)' are 
parameters related to the innovation variances and 
dependence in Y, respectively (Pourahmadi, 1999). 
The most common covariates used in the analysis 
of several real longitudinal data sets (Pourahmadi, 
1999; Pourahmadi and Daniels, 2002; Pan and Mac- 
Kenzie, 2003; Lin and Wang, 2009; Leng, Zhang and 
Pan, 2010) are in terms of powers of times and lags 

zt = {i,t,t\...,t''-'y, 

zt, = (i,t-j,{t-jf,...,{t-jr-^)'. 

A truly remarkable feature of (22) is its flexibil- 
ity in reducing the potentially high-dimensional and 
constrained parameters of S or the precision matrix 
to q + d unconstrained parameters A and 7. Fur- 
thermore, one can rely on graphical tools such as 
the regressogram or AIC to identify models such 
as (22) for the data; for more details see Pourah- 
madi (1999, 2001) and Pan and MacKenzie (2003). 
Liang and Zeger (1986) employ such parametrized 
models for covariance matrices in the context of the 
popular generalized estimating equations for longi- 
tudinal data. 

Computing the MLE of the parameters is rela- 
tively simple due to the special form of the loglikeli- 
hood function for a sample Yi, . . . ,Yn from a normal 
population with mean zero and the common covari- 
ance S parameterized as in (22). Note that, except 
for a constant, we have 

n 

-2/(A,7) = ^(iog|s|+y/s-iy,) 

i=l 

= n\og\D'^\ +ntrJ:~^S 

(23) =nlog\D^\+ntrD-'^TST' 
= nlog \D'^\ 

+ ntTD-^{Ip - B)S{Ip - B)', 

where S = ^ EHi^i^/' B = Ip - T and the last 
three equalities are obtained by replacing for S"^ 
from (15) and some basic matrix operations involv- 
ing trace of a matrix. Since (23) is quadratic in B, for 
a given the MLE of B or (ptj^s has a closed form. 



the same is true of the MLE of for a given B 
(Pourahmadi, 2000; Huang et al., 2006; Huang, Liu 
and Liu, 2007). This simplicity in computing the 
MLE of the saturated (unstructured) model for {T, 
D) is important when comparing the computational 
aspects of Cholesky-based estimation of the preci- 
sion matrix with the Rocha, Zhao and Yu (2008) 
SPLICE algorithm; see Section 3.4. 

An algorithm for computing the MLE of the pa- 
rameters (7, A) using the iterative Newton-Raphson 
algorithm with Fisher scoring is given in Pourah- 
madi (2000) along with the asymptotic properties of 
the estimators. An unexpected finding is the asymp- 
totic orthogonality of the MLE of the parameters A 
and 7, in the sense that their Fisher information 
matrix is block-diagonal; see Pourahmadi (2007a) 
and references therein. When the assumption of nor- 
mality is questionable like when the data exhibit 
thick tails, then a multivariate i-distribution might 
be a reasonable alternative; see Lin and Wang (2009) 
and Lin (2011). 

3. REGULARIZATION OF THE SAMPLE 
COVARIANCE MATRIX 

This section is devoted to high-dimensional data 
where the sample covariance matrix is known to 
be a poor estimator, and not even invertible when 
n. We review some alternative and improved 
estimators obtained by regularizing the sample co- 
variance matrix in various ways. After presenting 
a few loss functions in Section 3.1, we review in Sec- 
tions 3.2 and 3.3 shrinkage estimators obtained by 
minimizing certain risk functions. An early and in- 
spiring example is the Stein's family of shrinkage 
estimators that shrinks the eigenvalues of the sam- 
ple covariance matrix toward a central value. Pe- 
nalized normal likelihood estimators with a LASSO 
penalty on the precision matrix are reviewed in Sec- 
tion 3.4 with a focus on the graphical LASSO al- 
gorithm. Regularization methods which act elemen- 
twise on the sample covariance matrix such as ta- 
pering, banding and thresholding are discussed in 
Section 3.5. Some conditions for consistency of such 
estimators are also reviewed. 

3.1 Some Loss and Risk Functions 

Regularized estimators are usually obtained by 
minimizing suitable norms, risks or objective func- 
tions. For covariance matrix estimation the Frobe- 
nius and operator (spectral) norms are quite nat- 
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ural and have proved useful in establishing theo- 
retical properties of covariance estimators. For ex- 
ample, consistency in operator norm guarantees the 
consistency of the eigenstructure used in principal 
component analysis (Johnstone and Lu, 2009) and 
other related methods in multivariate statistics; see 
Section 3.5. 

The two commonly used loss functions when n> p 
are 

Li(S,S) =tr(SS^^) - log|SS"^| -p, 

L2(S,S)=tr(SS-^-/)2, 

where S = S(S') is an arbitrary estimator. The cor- 
responding risk functions are 

Ri{t,J:) = Ej:L,{t,^), i = l,2. 

An estimator S is considered better than S if its risk 
function is smaller than that of S. The loss func- 
tion Li was advocated by Stein (1956) and is usu- 
ally called the entropy loss or the Kullback-Liebler 
divergence of two multivariate normal densities cor- 
responding to the two covariance matrices. The sec- 
ond, called a quadratic loss function, is essentially 
the Euclidean or the Frobenius norm of its matrix 
argument which involves squaring the difference be- 
tween aspects of the estimator and the target. Con- 
sequently, it penalizes overestimates more than un- 
derestimates, and "smaller" estimates are more fa- 
vored under L2 than under Li. For example, among 
all scalar multiples aS of the sample covariance ma- 
trix, it is known (Haff, 1980) that S is optimal un- 
der Li, while the smaller estimator ^^^-^ is optimal 
under L2. 

Following the lead of Muirhead and Leung (1987), 
Ledoit and Wolf (2004) have used a slight modifica- 
tion of the Frobenius norm as the loss function 

L3(S, S) = p~^\\± - =p-i tr(S - S)2. 

Note that though dividing by the dimension p is 
not standard, it has the advantage that norm of the 
identity matrix is one, regardless of the size of p. 
Also, the loss L3 does not involve matrix inversion 
which is ideal with regard to computational cost for 
the "small n, large p" case. The heuristics behind 
this loss function are the same as those for L2 ■ How- 
ever, it has an additional and attractive feature that 
the optimal covariance estimator under L3 turns 
out to be the penalized normal likelihood estima- 
tor with trS~^ as the penaly (Warton, 2008; Yuan 
and Huang, 2009). Since the penalty function be- 
comes large when S gets closer to singularity, such 
a penalty forces the covariance estimators to be non- 
singular and better conditioned. 



3.2 Shrinking the Spectrum and the Correlation 
Matrix 

In this section we present one of the earliest im- 
provements of 5 obtained by shrinking only its eigen- 
values. Having observed that the sample covariance 
matrix systematically distorts the eigenstructure 
of S, particularly when ^ is large. Stein (1956, 1975) 
initiated the task of improving it. He considered or- 
thogonally invariant estimators of the form 

where A = (Ai, . . . , Xp)' , Ai > • • • > Ap > 0, are the or- 
dered eigenvalues of S, and P is the orthogonal ma- 
trix whose jth column is the normalized eigenvector 
of S corresponding to \j , and $(A) = diag((/5i , . . . , ipp) 
is a diagonal matrix where ipj = estimates the 

jth largest eigenvalue of S. For example, the choice 
of ipj = Xj corresponds to the usual unbiased estima- 
tor S, where it is known that Ai and Xp have upward 
and downward biases, respectively. Stein's method 
chooses <J>(A) so as to counteract the biases of the 
eigenvalues of S by shrinking them toward some cen- 
tral values. For the Li risk, his modified estimators 
of the eigenvalues of S are ifj = where 

Oj = aj{X) = n - p + 1 + 2Xj V] — . 

Xj — Xi 

Note that the ipj 's will differ the most from Xj when 
some or all of the Aj's are nearly equal and ^ is not 
small. Since some of the (pj^s could be negative and 
may not even satisfy the order restriction, Stein has 
suggested an isotonizing procedure to obtain modi- 
fied estimators satisfying the above constraints; for 
more details on this procedure see Lin and Perlman 
(1985). 

Lin and Perlman (1985) have applied the James- 
Stein shrinkage estimators (James and Stein, 1961) 
to the sample correlation in order to improve it for 
large p. They shrink the Fisher z-transform of the in- 
dividual correlation coefficients (and the logarithm 
of the variances) toward a common target value. 

3.3 Ledoit-Wolf Shrinkage Estimator 

To ensure nonsingularity of the estimated covari- 
ance matrix in the "n small, p large" case, Ledoit 
and Wolf (2004) present a shrinkage estimator that 
is asymptotically the optimal convex linear combi- 
nation of the sample covariance matrix and the iden- 
tity matrix with respect to L3. 
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One can motivate such an estimator by recall- 
ing that the sample covariance matrix S is unbi- 
ased for S, but unstable with considerable risk when 
p^n. By contrast, a structured covariance matrix 
estimator like the identity matrix has very little es- 
timation error, but can be severely biased when the 
structure is misspecified. A natural compromise be- 
tween these two extremes is a linear combination of 
them, giving a simple shrinkage or ridge candidate 
of the form 

S = ail + a2S. 

Now, one may choose ai and a2 to optimize certain 
criterion (Ledoit and Wolf, 2004). 

Using the Frobenius norm or minimizing the risk 
corresponding to the loss function L3, Ledoit and 
Wolf (2004) showed that the optimal choices of ai 
and a2 depend only on the following four-dimensio- 
nal aspects of the true (but unknown) covariance 
matrix S: 

/u = tr(S)/p, a2 = ||S-^/f, 

/32 = E||5-Sf, 6^ = E\\S-filf. 

Consistent estimators of these low-dimensional pa- 
rameters are provided by Ledoit and Wolf (2004), so 
that substitution in S results in a positive-definite 
estimator of T,. Through extensive simulation stud- 
ies they establish the superiority of this estimator 
to the sample covariance matrix and the empirical 
Bayes estimator (Half, 1980), among others. 

Warton (2008) taking 02 = 1 showed that such 
ridge estimators can be obtained using the penal- 
ized normal likelihood where the penalty term is 
proportional to trS~^. Evidently, such a penalty 
ensures that the estimator is a nonsingular matrix. 
He suggests using the cross-validation of the likeli- 
hood function for estimation of the ridge and the 
penalty parameters, and extends the approach to 
the ridge estimation of the correlation matrix. His 
method of estimation leads to the definition of suit- 
able test statistics for the parameters in multivariate 
linear regression in high-dimensional situations. The 
power properties of the test statistic are studied and 
compared with the principal components and gen- 
eralized inverse test statistics used in dealing with 
high dimensionality. 

3.4 The Penalized Likelihood Approach 

In this section we review various regularization 
methods based on penalizing the normal likelihood. 
These methods differ mostly on the LASSO penalty 



imposed on certain segments of the precision ma- 
trix. For example, Huang et al. (2006), Banerjee, 
El Ghaoui and d'Aspremont (2008), Friedman, Has- 
tie and Tibshirani (2008), Rothman et al. (2008) 
and Warton (2008), respectively, impose penalty on 
the Cholesky factor, all the entries, off-diagonal en- 
tries and the diagonal entries of the precision ma- 
trix. These can be viewed as methods for solving 
Dempster's (1972) covariance selection problem of 
inducing sparsity in the precision matrix. However, 
Warton's (2008) penalty leads to the Ledoit-Wolf 
estimator where neither T, nor its inverse is sparse. 

Motivated by the success of the LASSO estima- 
tors in the context of linear regression with a large 
number of covariates (Tibshirani, 1996), and in view 
of (19) and (20), it is plausible to induce sparsity in 
the precision matrix estimate by adding to the nor- 
mal loglikelihood (23) a penalty on the entries of the 
precision matrix or its Cholesky factor (Huang 
et al., 2006) 

(24) -2l + J2p>^^Mn^ 

i<j 

where a^^ is the (i,j)th entry of the precision ma- 
trix and Xij is the corresponding tuning parame- 
ter. Note that the LASSO penalty corresponds to 
Pa (1^1) = A|x|. Such an approach will inherit many 
desirable computational and statistical properties 
of LASSO and its many improved variants (Efron 
et al., 2004; Rocha, Zhao and Yu, 2008; Fan and Lv, 
2010, Section 3.5). 

Some early attempts at inducing sparsity in the 
precision matrix are Bilmes (2000), Smith and Kohn 
(2002), Wu and Pourahmadi (2003) and Levina, 
Rothman and Zhu (2008) who, for a fixed order 
of the variables in Y, use a parametrization of the 
precision matrix in terms of the modified Cholesky 
decomposition (15). Covariance selection priors and 
AIC were used to promote sparsity in T. Huang 
et al. (2006) proposed a covariance selection esti- 
mator by adding to the normal loglikelihood the 
LASSO penalty on the off-diagonal entries of T, and 
cross-validation was used to select a common regu- 
larization parameter; see also Huang, Liu and Liu 
(2007) and Levina, Rothman and Zhu (2008) for 
some improvements. Bickel and Levina (2008a) pro- 
vide conditions ensuring consistency in the operator 
norm for the precision matrix estimates based on 
banded Cholesky factors. 

Chang and Tsay (2010) extend the Huang et al. 
(2006) setup using an equi-angular penalty which 
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imposes different penalty on each row of T and the 
penalties are inversely proportional to the prediction 
variance of the ith regression. Extensive simu- 
lations were used to compare the performance of 
their method with others, including the sample co- 
variance matrix, banding (Bickel and Levina, 2008a) 
and the Li-penalized normal loglikelihood (Huang 
et al., 2006). Contrary to the banding method, the 
method of Huang et al. and the equi-angular method 
worked reasonably well for six covariance matrices, 
with the equi-angular method outperforming the oth- 
ers. Since the modified Cholesky decomposition is 
not permutation-invariant, they also use a random 
permutation of the variables before estimation to 
study the sensitivity to permutation of each method. 
They conclude that permuting the variables intro- 
duces some difficulties for each estimation method, 
except the sample covariance matrix, but the equi- 
angular method remains the best, with the banding 
method having the worst sensitivity to permutation. 
They also compare these methods by applying them 
to a portfolio selection problem with p = 80 series of 
actual daily stock returns. 

Two disadvantages of imposing sparsity on the fac- 
tor T are that its sparsity does not necessarily imply 
sparsity of the precision matrix, and the sparsity 
structure in T could be sensitive to the order of the 
random variables within Y . Some alternative meth- 
ods which tackle these issues penalize the precision 
matrix directly. For example, Banerjee, El Ghaoui 
and d'Aspremont (2008), Yuan and Lin (2007) and 
Friedman, Hastie and Tibshirani (2008) consider an 
estimate defined by the normal loglikelihood penali- 
zed by the Li-norm of the entries of These me- 
thods produce sparse, permutation-invariant estima- 
tors of the precision matrix, though some are com- 
putationally expensive. Yuan and Lin (2007) used 
the max-det algorithm to compute the estimator 
while imposing the positive-definiteness constraint; 
this seems to have limited their numerical results to 
j9 < 10 (Rothman et al. 2008, page 496). 

To date, the fastest available algorithm is the gra- 
phical lasso (glasso), proposed by Friedman, Hastie 
and Tibshirani (2008). It relies on the equivalence 
of the Banerjee, El Ghaoui and d'Aspremont (2008) 
blockwise interior point procedure and recursively 
solving and updating a series of LASSO regression 
problems using the coordinate descent algorithm for 
LASSO. Fortunately, the sparse covariance estima- 
tor from the graphical LASSO is guaranteed to be po- 
sitive definite. This important property follows from 



a result due to Banerjee, El Ghaoui and d'Aspremont 
(2008) showing that if the iterative procedure is ini- 
tialized with a positive-definite matrix, then the sub- 
sequent iterates remain positive definite. 

The sparse pseudo-likelihood inverse covariance 
estimation (SPLICE) algorithm of Rocha, Zhao and 
Yu (2008) and the SPACE (Sparse PArtial Correla- 
tion Estimation) algorithm of Peng et al. (2009) also 
impose sparsity constraints directly on the precision 
matrix, but with slightly different regression-based 
reparameterizations of S"-"^; see (7) and (9). They 
are designed to improve several shortcomings of the 
approach of Meinshausen and Biihlmann (2006), in- 
cluding its lack of symmetry for neighborhood se- 
lection in Gaussian graphical models. While Mein- 
shausen and Biihlmann (2006) use p separate lin- 
ear regressions to estimate the neighborhood of one 
node at a time, Rocha et al. and Peng et al. propose 
merging all p linear regressions into a single least 
squares problem where the observations associated 
to each regression are weighted according to their 
conditional variances. 

To appreciate the need for using approximate or 
pseudo- likelihood, it is instructive to note that un- 
like the sequence of prediction errors in (16), the e^'s 
from Section 2.2.2 are correlated so that l)^ is not 
really the covariance matrix of the vector of regres- 
sion errors e = (ei, . . . , £p)' . The use of its true and 
full covariance matrix in the normal loglikelihood 
would increase the computational cost at the estima- 
tion stage. This problem is circumvented in Rocha, 
Zhao and Yu (2008) and Friedman, Hastie and Tib- 
shirani (2010) by using a pseudo-likelihood function 
which in the normal case amounts to pretending 
that the Cov(e) is D^. To this pseudo- loglikelihood 
function, they add the symmetry constraints (8) and 
a weighted LASSO penalty on the off-diagonal en- 
tries to promote sparsity. A drawback of the SPLICE 
and SPACE algorithms is that they do not enforce 
the positive-definiteness constraint, hence, the re- 
sulting covariance estimators are not guaranteed to 
be positive definite. 

The sparsistency and rates of convergence for spar- 
se covariance and precision matrix estimation us- 
ing the penalized likelihood with nonconvex penalty 
functions have been studied in Lam and Fan (2009). 
Sparsistency refers to the property that all zero en- 
tries are actually estimated as zero with probability 
tending to one. In a given situation, sparsity might 
be present in the covariance matrix, its inverse or 
Cholesky factor. They develop a unified framework 
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to study these three sparsity problems with a general 
penalty function and show that the rates of conver- 
gence for these problems under the Frobenius norm 
are of the order C^'"^^ )^/^, where s = s„ is the num- 
ber of nonzero elements, p = pn is the size of the 
covariance matrix and n is the sample size. This re- 
veals that the contribution of high-dimensionality is 
merely of a logarithmic factor. 

3.5 Elementwise Shrinkage 

In this section we review a few alternative estima- 
tors like banding, tapering and thresholding which 
are based on the elementwise shrinkage of the sam- 
ple covariance matrix. These covariance estimators 
require a minimal amount of computation, except in 
the cross-validation for selecting the tuning param- 
eter which is computationally comparable to that 
for the penalized likelihood method. However, due 
to their emphasis on elementwise transformations, 
such estimators are not guaranteed to be positive 
definite. 

3.5.1 Banding and tapering the sample covariance 
matrix Many entries of the sample covariance ma- 
trix S = (sij) could be small or unstable in the 
"n small, p large" case. The most extreme case of 
this occurs in time series analysis where one has to 
work with only a single (long) realization {n = 1). 
The requirement of stationarity reduces the number 
of distinct entries of the p x p covariance matrix S 
from p{p + l)/2 to just p, which is still large. The 
moving average (MA) and autoregressive (AR) mod- 
els which further reduce the number of parameters 
are the prototypes of banding a covariance/precision 
matrix (Bickel and Levina, 2004; Wu and Pourah- 
madi, 2009; McMurry and Politis, 2010). 

Given the sample covariance matrix S = (sij) and 
any integer k, < k < p, its /c-banded (Bickel and 
Levina, 2008a) version defined by 

Bk{S) = [sijl{\i-j\<k)] 

can serve as an estimator for S . This kind of regular- 
ization is ideal when the indices have been arranged 
so that 

\i — j\ > k =^ (7jj = 0. 

This occurs, for example, if yi,y2, ■ ■ ■ , form a finite 
inhomogenous moving average process 

k 

yt = ^Ot,t-jej, 
i=i 

and ej's are i.i.d. with mean and finite variances. 



Banding is a special case of tapering which repla- 
ces S hy S * R, where (*) denotes the Schur (coordi- 
nate-wise) matrix multiplication and R = {rij) is 
a positive-definite symmetric matrix (Furrer and 
Bengtsson, 2007). It is known that the Schur prod- 
uct of two positive-definite matrices is also positive 
definite. Banding corresponds to using R = rij = 
{l{\i — j\ < k), which is not a positive-definite ma- 
trix. The idea of banding has also been used on the 
lower triangular matrix of the Cholesky decomposi- 
tion of S"^ by Wu and Pourahmadi (2003), Huang 
et al. (2006) and Bickel and Levina (2008a). While 
Furrer and Bengtsson (2007) have used tapering as 
a regularization technique for the ensemble Kalman 
filter, Kaufman, Schervish and Nychka (2008) use it 
for purely computational purposes in the likelihood- 
based estimation of the parameters of a structured 
covariance function for large spatial data sets. 

Asymptotic analysis of banding is possible when n, 
p and k are large. Bickel and Levina [(2008a), Theo- 
rems 1 and 2] have shown that, for normal data, the 
banded estimator is consistent in the operator norm 
(spectral norm), uniformly over a class of approx- 
imately "bandable" matrices, as long as — t- 0. 
They obtain explicit rate of convergence which de- 
pends on how fast A; — )• oo; see also Cai, Zhang and 
Zhou (2010). The consistency in operator norm guar- 
antees the consistency of principal component anal- 
ysis (Johnstone and Lu, 2009) and other related 
methods in multivariate statistics when n is small 
and p is large. Cai, Zhang and Zhou (2010) pro- 
pose a tapering procedure for the covariance matrix 
estimation and derive the optimal rate of conver- 
gence for estimation under the operator norm. They 
also carry out a simulation study to compare the fi- 
nite sample performance of their proposed estima- 
tor with that of the banding estimator introduced 
in Bickel and Levina (2008a). The simulation shows 
that their proposed estimator has good numerical 
performance, and nearly uniformly outperforms the 
banding estimator. 

3.5.2 Thresholding the sample covariance matrix 
When both n and p are large, it is plausible that 
many elements of the population covariance matrix 
are equal to 0, and, hence, S is sparse. How does 
one develop an estimator other than S to cope with 
this situation? The concept of thresholding origi- 
nally developed in nonparametric function estima- 
tion has been used in the estimation of large co- 
variance matrices by Bickel and Levina (2008b), El 
Karoui (2008a, 2008b) and Rothman, Levina and 
Zhu (2009). 
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For a sample covariance matrix S = {sij) the thres- 
holding operator Tg for s > is defined by 

Ts{S) = [sijl{\sij \ > s)]. 

Thus, thresholding 5 at s amounts to replacing by 
zero all entries with absolute value less than s. Its 
biggest advantage is its simplicity, as it carries no ma- 
jor computational burden compared to its competi- 
tors like the penalized likelihood with the LASSO 
penalty (Huang et al., 2006; Rothman et al., 2008; 
Friedman, Hastie and Tibshirani, 2008). A potential 
disadvantage is the loss of positive-definiteness as in 
banding. However, just as in banding, Bickel and 
Levina (2008b) have established the consistency of 
the threshold estimator in the operator norm, uni- 
formly over the class of matrices that satisfy a no- 
tion of sparsity, provided that — t- 0. An imme- 
diate consequence of the consistency result is that 
a threshold estimator will be positive definite with 
probability tending to one for large samples and di- 
mensions. 

4. BAYESIAN MODELING OF COVARIANCES 

Heuristically, there is an implicit equivalence be- 
tween regularization and Bayesian estimation in sta- 
tistics. This can be seen by suitable exponentiation 
of the penalty term in (24) and viewing it as a prior 
on the parameter space, or conversely by viewing 
a prior as a means of imposing constraints on the 
parameters. 

Traditionally, in Bayesian approaches to inference 
for S the Jefferys' improper prior and the conju- 
gate inverse Wishart (IW) priors are used. For some 
reviews of the earlier work in this direction, see Lin 
and Perlman (1985) and Brown, Le and Zidek (1994). 
However, the success of Bayesian computation and 
Markov Chain Monte Carlo (MCMC) in the late 
1980s did open up the possibility of using more flexi- 
ble and elaborate nonconjugate priors for covariance 
matrices; see Leonard and Hsu (1992), Yang and 
Berger (1994), Daniels and Kass (1999) and Hoff 
(2009). We present a brief review of the progress 
in Bayesian covariance estimation in a somewhat 
chronological order starting with priors put on the 
components of the spectral decomposition. 

4.1 Priors on the Spectral Decomposition 

Starting with the remarkable work of Stein (1956, 
1975), efforts to improve estimation of a covariance 
matrix have been confined mostly to shrinking the 
eigenvalues of the sample covariance matrix toward 



a common value (Dey and Srinivasan, 1985; Lin and 
Perlman, 1985; Half, 1991; Yang and Berger, 1994; 
Daniels and Kass, 1999; Hoff, 2009). Such covariance 
estimators have been shown to have lower risk than 
the sample covariance matrix. Intuitively, shrinking 
the eigenvectors is expected to further improve or 
reduce the estimation risk (Daniels and Kass, 1999, 
2001; Johnstone and Lu, 2009). 

There are three broad classes of priors that are 
based on unconstrained parameterizations of a co- 
variance matrix using its spectral decomposition. 
These have the goal of shrinking some functions of 
the off-diagonal entries of S or the corresponding 
correlation matrix toward a common value like zero. 
Consequently, estimation of the dependence 
parameters is reduced to that of estimating a few 
parameters. 

Perhaps, the first breakthrough with the GLM 
principles in mind is the log matrix prior due to 
Leonard and Hsu (1992) which is based on the ma- 
tricial logarithm defined in Section 2.2.3. Thus, for- 
mally a multivariate normal prior with a large num- 
ber of hyperparameters is introduced. They show 
the flexibility of this class of priors for the covari- 
ance matrix of a multivariate normal distribution, 
yielding much more general hierarchical and empiri- 
cal Bayes smoothing and inference, when compared 
with a conjugate analysis involving an IW prior. The 
prior is not conditionally conjugate, and according 
to Brown, Le and Zidek (1994), its major drawback 
is the lack of statistical interpretability of the entries 
of log S and their complicated relations to those of S 
as seen in Section 2.2.3. Consequently, prior elicita- 
tion from experts and substantive knowledge cannot 
be used effectively in arriving at priors for the en- 
tries of log E and their hyperparameters; see Liechty, 
Liechty and Miiller [(2004), page 2] for a discussion 
on the lack of intuition and relationship between 
log-eigenvalues and correlations. 

The reference (noninformative) prior for a covari- 
ance matrix in Yang and Berger (1994) is of the form 



i<j 



A,) 



where Ai > A2 > • • • > Ap are the ordered eigenvalues 
of E and c is a constant. Yang and Berger [(1994), 
page 1199] note that compared to the Jeffreys prior, 
the reference prior puts considerably more mass near 
the region of equality of the eigenvalues. Therefore, 
it is intuitively plausible that the reference prior 
would produce a covariance estimator with better 
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eigenstructure shrinkage. Furthermore, they point 
out that the reference priors for and the eigen- 
values of the covariance matrix are the same as p{T,). 
Expression for the Bayes estimator of the covari- 
ance matrix using this prior involves computation 
of high-dimensional posterior expectations; the com- 
putation is done using the hit-and-run sampler in 
a Markov chain Monte Carlo setup. An alternative 
noninformative reference prior for S (and the pre- 
cision matrix) which allows for closed-form poste- 
rior estimation is given in Rajaratnam, Massam and 
Carvalho (2008). 

It is known (Daniels, 2005) that the Yang and 
Berger's (1994) reference prior implies a uniform 
prior on the orthogonal matrix P and flat improper 
priors on the logoarithm of the eigenvalues of the co- 
variance matrix. The shrinkage priors of Daniels and 
Kass (1999) also rely on the spectral decomposition 
of the covariance matrix, but are designed to shrink 
the eigenvectors by reparametrizing the orthogonal 
matrix in terms of ^^^^^^ Givens angles (Golub and 
Van Loan, 1989) between pairs of the columns of 
the orthogonal matrix P. Since 6 is restricted to lie 
in the interval (— 7r/2, 7r/2), a logit transform will 
make it unconstrained so as to conform to the GLM 
principles. They put a a mean-zero normal prior on 
the logit tranformation of the Givens angles. The 
statistical relevance and interpretation of the Givens 
angles are not well understood at this time. The lo- 
cal parametrization of orthogonal matrices in Boik 
(2002) could shed some light on the problem of in- 
terpretation of the new parameters. The idea of in- 
troducing matrix Bingham distributions as priors on 
the group of orthogonal matrices (Hoff, 2009) could 
also be useful in shrinking the eigenvectors of the 
sample covariance matrix. 

Using simulation experiments, Yang and Berger 
(1994) compared the performance of their reference 
prior Bayes covariance estimator to the covariance 
estimators of Stein (1975) and Haff (1991) and found 
it to be quite competitive based on the risks corre- 
sponding to the loss functions Lj,i = 1,2. Daniels 
and Kass (1999), also using simulations, compared 
the performance of their shrinkage estimator to sev- 
eral other Bayes estimators of covariance matrices, 
using only the risk corresponding to the Li loss func- 
tion. It turns out that the Bayes estimators from 
the Yang and Berger's (1994) reference prior do as 
well as those from the Givens-angle prior for some 
nondiagonal and ill-conditioned matrices, but suf- 
fers when the true matrix is diagonal and poorly 
conditioned. 



4.2 The Generalized Inverse Wishart Priors 

The use of Cholesky decomposition of a covari- 
ance matrix or the regression dissection of the asso- 
ciated random vector has a long history and can be 
traced at least to the work of Bartlett (1933); see Liu 
(1993). It is shown by Brown, Le and Zidek (1994) 
that a regression dissection of the inverse Wishart 
(IW) distribution reveals some of its noteworthy fea- 
tures, making it possible to define flexible general- 
ized inverted Wishart (GIW) priors for general co- 
variance matrices. 

These priors are constructed by first partitioning 
a multivariate normal random vector Y with mean 
zero and covariance matrix S into k <p subvectors: 
Y = {Zi, . . . , Zk)' , and writing its joint density as 
the product of a sequence of conditionals: 

f{y) = f{zi)fiz2\zi) • • • f{Zk\Zk^l, Zi). 

Now, in each conditional distribution one places nor- 
mal prior distributions on the regression coefficients 
and inverse Wishart on the prediction variances. 
The hyperparameters can be structured so as to 
maintain the conjugacey of the resulting priors. It 
is known (Daniels and Pourahmadi, 2002; Rajarat- 
nam, Massam and Carvalho, 2008) that such priors 
offer considerable flexibility, as there are many pa- 
rameters to control the variability in contrast to the 
one parameter for IW. 

These ideas and techniques have been further re- 
fined in Garthwaite and Al-Awadi (2001) in prior 
distribution elicitation from experts, and extended 
to longitudinal and panel data setup in Daniels and 
Pourahmadi (2002) and Smith and Kohn (2002). 
The GIW prior was further refined in Daniels and 
Pourahmadi (2002) using the finest partition of Y, 
that is, using k = p. In this case all restrictions on 
the hyperparameters are removed from the normal 
and inverse Wishart (gamma) distributions and the 
prior remains conditionally conjugate, in the sense 
that the full-conditional of the regression coefficients 
is normal given the prediction variances, and the full- 
conditional of prediction variances is inverse gamma 
given the regression coefficients. For a review of cer- 
tain advantages of this approach in the context of 
longitudinal data and some examples of analysis of 
such data, see Daniels (2005) and Daniels and Hogan 
(2008). 

4.3 Priors on Correlation Matrices 

One of the first uses of variance-correlation decom- 
position in Bayesian covariance estimation seems to 
be due to Barnard, McCulloch and Meng (2000), 
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who, using p{T,) =p{D,R) = p{D)p{R\D), introdu- 
ced independent priors for the standard deviations 
in D and the correlations in R. 

Specifically, they used log normal priors on vari- 
ances independently of a prior on the whole ma- 
trix R. The latter is capable of inducing uniform 
(—1,1) priors on the entries pij of the correlation ma- 
trix R; see Liu and Daniels (2006). This is done by 
first deriving the marginal distribution of R when S 
has a standard IW distribution, Wp^ll,^),!/ > p, 
with the density 

fpi^lu) = c|sr(^/2){^+p+i) g^p(_ 1 

It turns out that 

1=1 

where Ru is the principal submatrix of R, obtained 
by deleting its ith. row and column. Then, using the 
marginalization property of the IW (i.e., a principal 
submatrix of an IW is still an IW), the marginal 
distribution of each pij, i ^ j, is obtained as 

f{p,,w)=c{i-pi)^''-p-'y', |p.,i<i. 

The latter can be viewed as a Beta (^^^^f^, ^^2~^^ ) 
on ( — 1, 1), which is uniform when u = p + 1. More- 
over, by choosing p <u <p+ \ or v > p+\, one can 
control the tail of f{pij\u), that is, making it heavier 
or lighter than the uniform. Thus, the above family 
of priors for R is indexed by a single "tuning" pa- 
rameter V. 

Liechty, Liechty and Miiller (2004) note that few 
existing probability models and parameterizations 
for covariance matrices allow for easy interpreta- 
tion and prior elicitation. They propose priors in 
which correlations are grouped based on similarities 
among the correlations or based on groups of vari- 
ables. A good example of this situation is in financial 
time series where it is often known that returns of 
stocks in the same industries are more closely related 
than others. 

4.4 Reparameterization via Partial 
Autocorrelations 

In this section we present yet another unconstrai- 
ned and statistically interpretable reparameteriza- 
tion of S, but now using the notion of partial auto- 
correlation function (PACF) from time series anal- 
ysis (Box, Jenkins and Reinsel, 1994; Pourahmadi, 
2001, Chapter 7). As expected, this approach, just 
like the Cholesky decomposition, requires an a priori 
order among the random variables in Y . It is moti- 
vated by and tries to mimic the phenomenal success 



of the PACF of a stationary time series in model 
formulation (Box, Jenkins and Reinsel, 1994) and 
removing the positive-definiteness constraint on the 
autocorrelation function (Ramsey, 1974). We note 
that reparameterizing the stationarity-invertibility 
domain of ARMA models by Jones (1980) had a pro- 
found impact on algorithms for computing the MLE 
of the ARMA coefficients and guaranteeing that the 
estimates are in the feasible region. 

Starting with the variance-correlation decompo- 
sition, we focus on reparameterizing the correlation 
matrix R = (pij) in terms of entries of a simpler sym- 
metric matrix 11 = (iTij), where tth = 1 and for i < j, 
TTij is the partial autocorrelation between yi and yj 
adjusted for the intervening (not the remaining) 
variables. More precisely, vrj^j+i = pi^i^i,i = 1,..., 
p — 1, are the lag-1 correlations and for j — i>2, 
T^ij = Pij\i+i,...,j-i in the notation of Anderson (2003), 
page 41. Note that, unlike i?, and the matrix of full 
partial correlations (p*-') constructed from in 
Section 2.2.2, 11 has a much simpler structure in that 
its entries are free to vary in the interval (—1,1). 
If needed, using the Fisher z-transform 11 can be 
mapped to the matrix 11 where its off-diagonal en- 
tries take values in the entire real line (— oo, -|-oo). 

Compared to the long history of using the PACF 
in time series analysis (Quenouille, 1949), research 
on establishing a one-to-one correspondence between 
a general covariance matrix and (-D,n) has a rather 
short history. An early work in the Bayesian context 
is due to Eaves and Chang (1992), followed by Zim- 
merman (2000) and Pourahmadi [(1999, 2001), pa- 
ge 102] for longitudinal data, Degerine and Lambert- 
Lacroix (2003) for the nonstationary time series, 
and Kurowicka and Cooke (2003) and Joe (2006) for 
a general random vector. The fundamental determi- 
nantal identity, 

(25) isi=fn--)rin(i-4)' 

\j=l / j=2j=l 

has been redicovered recently by Degerine and Lam- 
bert-Lacroix (2003), Kurowicka and Cooke (2003) 
and Joe (2006), but its origin can be traced to a no- 
table and somewhat neglected paper of Yule (1907), 
equation (25). 

The identity (25) plays a central role in Joe's (2006) 
method of generating random correlation matrices 
whose distributions are independent of the order of 
variables in Y . It is used in Daniels and Pourahmadi 
(2009) to introduce priors for the Bayesian analysis 
of correlation matrices. These papers employ inde- 
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pendent linearly transformed Beta priors on ( — 1,1) 
for the partial autocorrelations vTjj. However, Jones 
(1987) seems to be the first to use such Beta priors 
in simulating data from "typical" ARMA models. 

5. CONCLUSIONS 

We have reviewed progress in covariance estima- 
tion for low- and high-dimensional data, from the 
narrow perspectives of the GLM and regularization 
or parsimony and sparsity. Recent appearance of 
many regression-based techniques and the use of 
LASSO-type penalties show that covariance estima- 
tion can benefit greatly from mimicking/using the 
conceptual and computational tools of regression ana- 
lysis. Fortunately, mostly due to the computational- 
algorithmic advances centered around LASSO, the 
high-dimensionality challenge in covariance estima- 
tion has been become manageable, however, the posi- 
tive-definiteness challenge still remains. Its removal 
could not only further reduce the computational cost 
due to high-dimensionality, but is also crucial for par- 
simony and writing simple, interpretable models us- 
ing covariates. Among the three matrix decomposi- 
tions, the spectral and Cholesky decompositions are 
the most helpful in removing the positive-definiteness 
constraint. These along with some recent covariance 
estimation algorithms enforcing the positive-defini- 
teness suggest that there are trade-offs among the 
requirements of unconstrained parameterization, sta- 
tistical interpretability and the computational cost. 

In summary, the problem of removing the positive- 
definiteness constraint remains open, in the sense 
that, as yet, no unconstrained and statistically in- 
terpretable reparameterization exists for a general 
covariance matrix without imposing an order on the 
variables. 
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